fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status by localai-bot · Pull Request #9754 · mudler/LocalAI

localai-bot · 2026-05-11T07:21:53Z

Summary

Stale node_models rows (state="loaded") were surviving past the healthy state of their owning node, causing /embeddings (and other inference paths) to dispatch to a backend whose process was gone or drained. The downstream symptom was pgvector rejecting inserts as vector cannot have more than 16000 dimensions (SQLSTATE 54000) because the misbehaving backend silently returned a malformed (oversized) tensor; the Models page showed the model as "running" without an associated node, like a stale entry, even though the node was no longer visible in the Nodes view.

Why

Repro path observed live:

qwen3-embedding-4b was registered on a worker that had been operationally drained (process gone, but node-level heartbeats still arriving).
FindAndLockNodeWithModel picked the stale node_models row because the JOIN only filtered by node_models.state = "loaded"; the status-check happened in a second query whose failure rolled back the transaction but didn't prevent the same row from being picked next time.
probeHealth (gRPC ping) passed against the drained worker.
The backend returned an oversized tensor → /embeddings returned 200 → LocalRecall's pgvector insert blew up at the 16000-dim ceiling.
MarkUnhealthy / MarkDraining only flipped backend_nodes.status; unlike MarkOffline, they didn't cascade-clean node_models. Stale rows accumulated.

Changes (defense-in-depth)

MarkUnhealthy / MarkDraining cascade-delete node_models — mirrors MarkOffline. On the unhealthy path the backend is presumed unreachable; on drain, routing already filters these out, but clearing the rows stops the UI from misreporting and prevents future scheduling-logic relaxations from picking them up.
FindAndLockNodeWithModel inner JOIN now also filters by backend_nodes.status = healthy — the previous version relied on the second node-fetch step to reject non-healthy nodes, but a concurrent reader could still pick the same stale row in the same window. Belt-and-braces.
DistributedConfig.PerModelHealthCheck → DisablePerModelHealthCheck (inverted; on by default) — the per-model gRPC probe independently health-checks each model's gRPC address and removes stale node_models rows when the backend has crashed even though the worker's node-level heartbeat is still arriving.

Migration

DistributedConfig.PerModelHealthCheck was renamed and inverted to DisablePerModelHealthCheck. The field had no CLI flag, env var binding, or YAML key in tree (only the bare struct field), so there's no user-facing migration. Anything constructing DistributedConfig in code needs to drop the assignment (default now does the right thing) or invert it.

Test plan

go test ./core/services/nodes/ (100s, all pass)
go vet ./core/services/nodes/ ./core/config/ ./core/application/ clean
Live verification in the dev cluster where the original symptom was observed: drain a backend hosting an embedding model, confirm /embeddings either returns an error (no replica) or routes to a healthy one — no 200-with-garbage, no pgvector dimension overflow downstream.

🤖 Generated with Claude Code

… routing by healthy status Stale node_models rows (state="loaded") were surviving past the healthy state of their owning node, causing /embeddings (and other inference paths) to dispatch to a backend whose process was gone or drained. The downstream symptom in a live cluster was pgvector rejecting inserts with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)" because the misbehaving backend silently returned a malformed (oversized) tensor; the Models page showed the model as "running" without an associated node, like a stale entry, even though the node was no longer visible in the Nodes view. Two changes here, plus a third in a follow-up commit: - MarkDraining now cascade-deletes node_models rows for the affected node, mirroring MarkOffline. Drains are explicit operator actions — the box has been intentionally taken out of rotation — so clearing the rows stops the Models UI from misreporting and prevents the routing layer from picking those rows if scheduling logic is ever relaxed. In-flight requests already hold their gRPC client through Route() and finish normally; the only observable effect is a non-fatal IncrementInFlight warning, acceptable for a drain. MarkUnhealthy is deliberately left status-only: it fires from managers_distributed / reconciler on a single nats.ErrNoResponders with no retry, so a transient NATS hiccup must not nuke every loaded model and force a full reload on recovery. - FindAndLockNodeWithModel's inner JOIN now filters on backend_nodes.status = healthy in addition to node_models.state = loaded. The previous version relied on the second node-fetch step to reject non-healthy nodes, but a concurrent reader could still pick the same stale row in the same window. Belt-and-braces. - DistributedConfig.PerModelHealthCheck renamed to DisablePerModelHealthCheck and inverted at the call site so per-model gRPC probing is on by default. The probe (now made consecutive-miss aware in a follow-up commit) independently health- checks each model's gRPC address and removes stale node_models rows when the backend has crashed even though the worker's node-level heartbeat is still arriving. Migration: the field had no CLI flag, env var binding, or YAML key in tree (only the bare struct field), so there is no user-facing migration. Anything constructing DistributedConfig in code needs to drop the assignment (default now does the right thing) or invert it. Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…emoves a row The per-model gRPC probe used to remove a node_models row on a single failed health check. With the per-model probe now on by default, that made any 5-second gRPC blip (network jitter, a long-running request hogging the worker's gRPC server thread, brief GC pause) trigger a full reload of the affected model — too eager for production. Require perModelMissThreshold (3) consecutive failed probes before removal. At the default 15s tick a model must be unreachable for ~45s before reap; a single successful probe in between resets the streak. Per-(node, model, replica) state tracked under a mutex on the monitor. If the removal call itself fails, the miss counter is left in place so the next tick retries rather than starting the streak over. Tests: - removes stale model via per-model health check after consecutive failures (replaces the single-shot expectation) - preserves model row when an intermittent failure is followed by a success (covers the reset-on-success path and verifies the counter reset by failing twice more without crossing threshold) - newTestHealthMonitor initializes the misses map so direct-construct test helpers don't nil-map-panic in the probe path Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 2 commits May 11, 2026 08:24

mudler force-pushed the fix/stale-node-models-cleanup branch from 78fb08b to 9029ec2 Compare May 11, 2026 08:37

mudler merged commit b4fdb41 into master May 13, 2026
54 checks passed

mudler deleted the fix/stale-node-models-cleanup branch May 13, 2026 19:57

mudler added the bug Something isn't working label May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status#9754

fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status#9754
mudler merged 2 commits into
masterfrom
fix/stale-node-models-cleanup

localai-bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 11, 2026

Summary

Why

Changes (defense-in-depth)

Migration

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants